-
Notifications
You must be signed in to change notification settings - Fork 19
Refactor: ingester changes #1574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
||
|
|
||
| def build_instances_from_submission(data: dict[str, Any]) -> dict[TableNames, list]: | ||
| def build_instances_from_submission(data: dict[str, Any]) -> ProcessedSubmission: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to avoid using Any? Maybe with generic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this data is the submission json, so the Any means any value that an item can have: str, int, list, dict, none... IMO in this case it is ok to leave it as Any (and other similar cases around these functions too). This use will at least define that they dict keys are str, otherwise we would be using just dict which just means dict[Any, Any]
| STORAGE_BASE_URL = os.environ.get( | ||
| "STORAGE_BASE_URL", "https://files-staging.kernelci.org" | ||
| ) | ||
| CONVERT_LOG_EXCERPT = False # If True, convert log_excerpt to output_files url |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not an env as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything was defined statically in the file, I just added verbose as an env var. I'll make the new constants file and I can make them all env vars 👍
| ) | ||
| CONVERT_LOG_EXCERPT = False # If True, convert log_excerpt to output_files url | ||
|
|
||
| TREES_FILE = "/app/trees.yaml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think a new file:
constants/ingester.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
| if VERBOSE: | ||
| logger.info("Uploading logexcerpt for %s to %s", id, upload_url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idk this VERBOSE variable could be more descriptive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean the description for the VERBOSE variable or the string in this specific logger.info?
| str: On success upload: the reference url. On failed upload: the original logexcerpt | ||
| """ | ||
|
|
||
| upload_url = f"{STORAGE_BASE_URL}/upload" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outside of function
fa7db92 to
1255402
Compare
| """Write debug/perf output to stdout. Logger was unreliable in some environments.""" | ||
| try: | ||
| print( | ||
| f"[{time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())}] {msg}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nested double quotes on f-string
| if os.path.isfile(os.path.join(spool_dir, f)) and f.endswith(".json") | ||
| ] | ||
| if not json_files: | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok to return like this instead of tuple? I think we should return empty list and 0 files
return [], 0 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this simple return should be for the ingest_submissions_parallel, so I returned the [], 0 here and added another condition there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed with the rebase, not needed anymore
| ) | ||
| ) | ||
| json_files, total_bytes = get_spool_files(spool_dir) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not json_files: maybe? i there is no spool files, we might need to put small delay and check again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I had moved the if not json_files to the other function but it should be here. I'll fix it
| """ | ||
| Consumes the list of objects and tries to insert them into the database. | ||
| """ | ||
| total = len(issues_buf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think should be something like:
total = (
len(issues_buf)
+ len(checkouts_buf)
+ len(builds_buf)
+ len(tests_buf)
+ len(incidents_buf)
)
| from kernelCI_app.models import Builds, Checkouts, Incidents, Issues, Tests | ||
|
|
||
| type TableNames = Literal["issues", "checkouts", "builds", "tests", "incidents"] | ||
| type TableModels = Issues | Checkouts | Builds | Tests | Incidents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure in this part...
TableModels = Issues | Checkouts | Builds | Tests | Incidents denotes instances, but MODEL_MAP stores classes and consume_buffer expects a class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it denote instances? IIUC it denotes the classes themselves, hovering over the TableModels variable I see (type) TableModels = type[Issues] | type[Checkouts]... directly
| INGEST_FLUSH_TIMEOUT_SEC = float(os.environ.get("INGEST_FLUSH_TIMEOUT_SEC", 2.0)) | ||
| except (ValueError, TypeError): | ||
| logger.warning("Invalid INGEST_FLUSH_TIMEOUT_SEC, using default 2.0") | ||
| INGEST_FLUSH_TIMEOUT_SEC = 5.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If default is 2.0, we must set 2.0 here too
| from typing import Literal | ||
| from kernelCI_app.models import Builds, Checkouts, Incidents, Issues, Tests | ||
|
|
||
| type TableNames = Literal["issues", "checkouts", "builds", "tests", "incidents"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This require Python >=3.12 (PEP 695). Do we use 3.12 and above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do, it's set on pypoetry.toml and we have been using this format for a while
acc8bd1 to
40ba06b
Compare
Even with this reduction, it's still hard to lower the complexity even further without creating idempotent functions Closes kernelci#1571
40ba06b to
6dab5dc
Compare
Changes
Refactors ingester code:
How to test
The behavior should be the same as before. So:
backend/kernelCI_app/management/commands/tmp_submissions(you can change the path, just update the command below)/backend/volume_dataas a result of thetreeproofcommand. If you are feeling lazy and you don't want to run the command, there's an example file here (Google Drive link)USE_DASHBOARD_DB=Truepoetry run python3 ./manage.py monitor_submissions --spool-dir kernelCI_app/management/commands/tmp_submissions --trees-file volume_data/trees-name.yaml(being in the/backenddirectory, update the paths if necessary)You can also test the ingester on docker, make sure the environment variables are on
.env.backendand usedocker compose run --rm backendbefore running the command shown above.Closes #1571